The Freedom of the Press Writings

The Trykkefrihedensskrifter or Freedom of the Press Writings is a collection of small books (pamphlets) that got published during the press freedom era in Denmark between 1770 and 1773. Before that period, all books that people wanted to get published had to be approved by university professors. Through Johann Friedrich Struensee’s reforms this was not longer necessary and people started to write and publish their thoughts in the form of small pamphlets. In these pamphlets, the authors discussed everything from serious political, philosophical, economic treatises, political commentary, criticism and satire, over essay writing, fiction and entertainment to gossip, libel and pornography. Bolle Willum Luxdorph, a Danish government official at that time, collected around 1000 of these pamphlets, which have now been digitized and made accessible via the Danish Royal Library. You can have a look at all the books here.

This was made possible by a book project. My colleague Frederik Stjernfelt wrote a massive book about the pamphlets and this time called Grov Konfækt which received a lot of praise by the media.

Why am I interested in these books?

Well, I have a passion for Digital Humanities and getting insights into a large corpus of Danish pamphlets seemed like a very interesting project. One aspect that is really crucial about the 1000 books is that half of them are of unknown authorship. About a year ago I made the first efforts to find out who the authors of these books with unknown authorships were. In academia this discipline is usually called Styleometry which aims at solving such authorship attribution problems. Back then I tried an approach called bootstrapp consensus networks. I haven’t really had much success with this approach, but the results of these experiments can be found in this DHiNorden 2020 paper.

\(~\)

Why is authorship attribution (AA) in the Freedom of the Press Writings such a difficult problem?

Well there are three reasons:

  1. The OCR has a lot of errors which makes the collection difficult to work with. AA algorithms depend on error free text so style can be detected. I am still waiting for an improved version of the OCR which could potentially lead to better AA results.

  2. In principle, this is not a classic AA problem. In a classic AA problem one would have only a couple of unknown books and a clear set of potential authors of these books. In this case one would do some kind of machine learning based on training material of books with known authors with features that capture the style of these authors to predict which of the known writers is most likely also the author of the book(s) in question. This is not the case here. We don’t know the author of around half of these books, and whether all potential authors are actually known is not given either. However, how we nevertheless can try to approach this problem will be outlined below. By collecting small evidences I try to narrow down on a smaller set of books with known/unkown authors which would then make a classic AA scenario possible. Let’s start to collect some evidence.

  3. The books are characterized by many different gernes and languages. We can find German, French and English books. Some books contain poems, lyrics or only lists of commodities. For now I tried to filter as many out as possible focusing only on ‘normal’ text. Thus this reduced the number of books to 735 presented here.

\(~\)

Step 1: Basic Lexical Concepts and Measurements

In a first step, I want to investigate whether by only looking at lexical concepts like measures of vocabulary richness (e.g. type-token-ratio) or the share of big words tell us something about how different the style of our authors is and how the idiosyncrasies in writing style are manifested. Let’s look at some summary statistics of these measures for the top ten most publishing authors in the dataset.

author num_books mean_avg_sent_length mean_book_token_count mean_book_avg_token_length mean_book_types mean_type_token_ratio_book mean_herdan_c
MartinBrun 54 34.03 2323.98 4.53 1079.54 0.50 0.91
J.C.Bie 16 27.99 4213.31 4.69 1778.12 0.47 0.91
J.L.Bynch 16 35.83 6113.00 4.68 2300.19 0.45 0.90
P.F.Suhm 14 29.44 7794.50 4.68 2623.86 0.43 0.90
SørenRosenlund 14 34.81 6202.64 4.48 2155.29 0.39 0.89
ChristianBagge 11 41.45 2596.64 4.95 1147.00 0.48 0.90
F.C.Scheffer 9 24.23 3620.22 4.67 1565.78 0.46 0.90
Chr.Thura 6 48.51 11856.83 4.76 3123.83 0.38 0.89
L.Jæger 6 83.86 12752.00 4.71 3786.00 0.31 0.88
O.D.Lütken 6 32.53 10584.33 5.05 3236.33 0.36 0.89

\(~\)

Martin Brun wrote 54 pamphlets and is the most represented author. However, his books are rather short. The others, especially L.Jæger, write much longer books. The average token length is supposed to cover the aspect of who might use long words. But no real differences become evident. The type-token-ratio (TTR) is, as already mentioned, a measure that covers vocabulary richness. The higher this ratio the more unique words an author is using which hints towards a presence of a rich vocabulary. However, care needs to be taken as this value is not normalized with resepct to text lenght which means that longer texts automatically have a lower TTR. In our case Brun seems to use a richer vocabulary compared to Thura, Lütken and Jaeger, but this is probably due to their texts being longer. A text-lenght normalized value of vocabulary richness is Herdan’s C. And when looking at the values there are basically no differences. Finally, average sentence lenght differs strongly. Jaeger has very long sentences while Bie very short ones. Now this value is not very reliable, because the bad OCR doesn’t really reliable cover interpunctation signs so this could lead to false sentence detection. To sum up, the lexical measure don’t really show strong differenes for the most publishing authors which in result also means that they might be difficult to distinguish using those as features.

\(~\)

Step 2: Burrow’s Delta

For authorship attribution, Burrow suggest using the most frequent word types (MFW) as these very frequent items (which mainly correspond to function words) are used mainly unconsciously by the author and thus suitable to reflect his style. Every author is then represented by a feature vector or an author profile which can be used to calculate the distance to author authors or single texts. Single text of unknown authorship very close to an authors’ profile could be an indicator that this text was in fact written by that author. To do this I take the following steps:

  • For known authors, all their texts are pasted together so author profiles can be created.
  • This is done by taking the 300 MFW in the whole corpus consisting of uni- bi- or trigrams (depending on the frequency). Now, each text with unknown authorship and author profiles are represented by a vector with 300 entries.
  • The values in the vector are not raw frequencies. They are relative to text length and z-standardised or normalized with respect to the occurrence frequency of that feature in the corpus. This means each feature vector has a mean of 0 and a standard deviation of 1.

These vectors can be used to calculate the distance between pamphlets with unknown authorship to author profiles. Moreover, we can check which authors have a similar style, i.e. a small Delta distance value between each other. Again those authors might be difficult to distinguish from one another.

Let’s first have a look at pamphlet with ID 1.1.10 which, according to the bibliography, is of unknown authorship. One can see quite many authors having similar Delta distance values with respect to this pamphlet, making it quite difficult to collect evidence for who might be the real author.

\(~\)

When looking at pamphlet 2.13.1 we see that there is actually quite a gap between the closest author Bie and the next closet one Martin Brun. This might suggest that Bie is the real author of that book. However, we would have to study that in more detail.

\(~\)

Which authors might be difficult to distinguish?

We can look at a heatmap of Delta distances between authors. Let’s for now focus on the 20 authors with the most books published. This is an interactive graph so feel free to explore it in detail. We can see that some authors have a distinct writing styles, but this is unfortunately not for all the case. In general, the Delta distance is only ranging from 0.25 - 0.50. However, in the whole corpus large distances of 1.5 can be observerd. However, when looking at Martin Brun, for example, he is close to many known authors.

\(~\)

Step3: Using UMAP for dimensionality reduction

For collecting even more evidence and sorting out which pamphlets it is worth to investigate further, I want to perform a visualization of all author profiles and texts. We can’t just do a regular scatterplot as we don’t only have 2 but 300 dimensions to plot. So how can we solve that problem? We use an algorithm for reducing the 300 dimensions from the MFW to 2 which we can plot. In this case, I used UMAP (Uniform Manifold Approximation and Projection for Dimension Reduction) an algorithm that does similar things like a Principal Component Analysis (PCA) or Single Value Decomposition (SVD), namely reducing our dimensions which makes visualisation easier.

This is again an interactive plot. Feel free to explore all points. You can also hide certain points or make them visible by clicking at the legend on the right. Some points might be hidden so it is worth to hide the dots with no author names to see that Bie and Bynch are quite close and thus similar in style? In fact, when looking at the heatmap above (Step 2) Bie and Bynch are only distant by 0.39. However, Bie and Martin Brun are also only distant by 0.50. Brun, however, lies on the other side of the plot. Moreover, the pamphlet with unknown authorship 2.13.1, which has a low Delta distance to Bie is quite distant position in the UMAP plot, with Bynch and Jæger being much closer to this book, two authors that are not even in the top 10 of closest author profiles following the delta distance bart chart. How should this be interpreted? I don’t know so far :). Which ‘distance’ is more reliable, or in other words, better reflecting the author’s writing style? The metric Delta distance or the visual UMAP distance? Btw. I also tried to do a PCA and when plotting the first two dimension a similar picture emerges.

\(~\)

What’s next?

Did the 300 uni- bi- and trigram MFW feature vectors capture the style of our authors? Well… to some degree. However, one could certainly explore further options like more features or adding punctuation. The aspects presented here will be the basis for collecting evidence on which set of pamphlets with unknown authorship can be matched with potential author profile candidates for a classic closed-set authorship attribution scenario. In this scenario, I will use various machine learning algorithms and features to perform multinominal text classification. Stay tuned for more infos …. but for now I will wait till I get my hands on the improved set of OCRed books.